Canonical Layer
The canonical layer creates a single source of truth for your business entities. It harmonizes data from multiple sources into unified, reliable datasets that represent core business concepts.
Example Folder Structure
models
└── canonical
├── orders
│ ├── orders.sql
│ └── orders.yml
├── customers
│ ├── customers.sql
│ └── customers.yml
└── products
├ ── products.sql
└── products.yml
Structure Best Practices
- ✅ Entity-Based Structure ~ Group by business entity (orders, customers, etc.)
- ✅ File Naming ~ Use clear, simple entity names (customers, products, etc.)
- ✅ Single Source of Truth ~ One canonical model per business entity
Purpose
Canonical models serve to:
- Unify data from multiple sources
- Resolve conflicts between sources
- Establish standard definitions
- Create consistent business entities
Example Scenarios
| Scenario | Purpose | Example |
|---|---|---|
| Multiple Sources | Combine order data from web and retail systems | Orders from Shopify and POS |
| Conflicting Data | Resolve different customer details across systems | Customer details from CRM and ecommerce |
| Data Standards | Establish consistent product categorization | Unified product hierarchy |
Example Model
-- models/canonical/customers/customers.sql
WITH crm_customers AS (
SELECT * FROM {{ ref('stg_salesforce__customers') }}
),
web_customers AS (
SELECT * FROM {{ ref('stg_shopify__customers') }}
),
unified AS (
SELECT
COALESCE(crm.customer_id, web.customer_id) as customer_id,
-- Prefer CRM data over web data
COALESCE(crm.email, web.email) as email,
COALESCE(crm.name, web.name) as customer_name,
crm.customer_segment,
web.first_order_date,
-- Additional fields and logic
FROM crm_customers crm
FULL OUTER JOIN web_customers web
ON crm.email = web.email
)
SELECT * FROM unified
Materialization
Canonical models should be materialized as tables for optimal performance:
# dbt_project.yml
models:
your_project:
canonical:
+materialized: table
Why Tables?
Frequently accessed by downstream models Complex transformations best stored as tables Serve as stable reference points Better query performance
Documentation and Testing
Thorough documentation and testing are crucial for canonical models:
# models/canonical/customers/customers.yml
version: 2
models:
- name: customers
description: Single source of truth for customer data
columns:
- name: customer_id
description: Unique identifier for customers
tests:
- unique
- not_null
- name: email
description: Primary customer email
tests:
- not_null
- unique
tests:
- dbt_utils.equal_rowcount:
compare_model: ref('stg_salesforce__customers')
Best Practices
Data Quality
- Implement comprehensive testing
- Monitor data freshness
- Track source system changes
- Document business rules
Performance
- Index key columns
- Partition large tables
- Optimize join conditions
- Schedule appropriate refresh times
Governance
- Document data lineage
- Define clear ownership
- Maintain version control
- Monitor usage patterns
Common Pitfalls
- Creating canonical models without clear business need
- Insufficient testing of source data integration
- Poor documentation of business rules
- Lack of stakeholder alignment
Success Criteria
A good canonical model should:
- Have clear, unique business entity definition
- Resolve conflicts consistently
- Be well-documented and tested
- Serve as a reliable reference
- Maintain data quality standards